This section provides a review of XML features and conventions for quick reference.
The parts of an XML document
An XML document consists of the following parts, in this order:
1. An XML declaration (optional, but highly recommended)
2. A DOCTYPE declaration and DTD (optional), including comments, processing instructions, and entity references
3. XML elements (and their attributes), comments, processing instructions, and entity references
XML declaration
The XML declaration, if included, must be the first line in a XML document. It indicates the version of XML that the document adheres to, and whether the file includes any references to other files. For example:
<?xml version="1.0" standalone="no"?>DOCTYPE declaration (including DTD)
The DOCTYPE declaration which specifies the document's DTD goes after the XML declaration and before the opening tag of the root element. There are two potential parts to any DTD: The external subset and the internal subset. If a document has only an external subset, it looks like this:
<?xml version="1.0" standalone="no">If a document has only an internal subset, it looks like this:
<?xml version="1.0" standalone="yes">If a document has both an external subset and an internal subset, it looks like this:
<?xml version="1.0" standalone="no">Elements
An element consists of an opening tag (<tagName>), some content, and a closing tag (</tagName>):
<tagName>Content goes here.</tagName>An exception is the empty tag, which may be a single tag with a forward slash before the closing >:
<emptyTag/>All elements must be properly nested, meaning that the most recently opened tag must be closed before you can close any other tags. For example, the following line would be illegal in an XML document because it does not close <tag2> before closing <tag1>:
<tag1><tag2>Content goes here.</tag1></tag2>Each XML document must have a root element that contains all the other elements in the document.
Element names are case-sensitive. Each element name must begin with a letter or an underscore (_); subsequent characters in the name may be letters, underscores, numbers, hyphens and periods, but not spaces or tabs.
Attributes
Elements may have attributes as part of their opening tag (or, for empty elements, as part of the single opening/closing tag). An attribute consists of an attribute name followed by an equals sign and then an attribute value in quotation marks. For example:
<elementName attributeName="attributeValue">Content</elementName>Comments
A comment consists of text between a <!-- and a -->. The content of comments should be ignored by XML processors. Comments may not contain "--" and they may not contain other comments.
<!-- This is a comment. Characters such as < and > are legal here. -->Processing instructions
A processing instruction consists of text between a <? and a ?>. Processing instructions are read only by XML processors and may not contain content. The syntax for processing instructions is as follows:
<?target instruction?>Character references
A character reference is way of representing Unicode characters in parsed character data. The syntax for character references is as follows:
&#UnicodeValueOfCharacter;Entity references
An entity reference is a name that represents a specific character, text string, or file. Entity references in an XML document are always between an ampersand (&) and semicolon (;). For example, > represents a greater than sign (<), which may not be included in XML content except as an entity reference.
The meaning of each entity reference used in an XML document must be defined in the document's DTD, with the exception of the following predefined character entity references, which may be used without being defined:
Character | Entity reference |
< | < |
> | > |
& | & |
" | " |
' | ' |
Well-formed XML
To be well-formed, an XML document must follow these rules:
Valid XML
A valid XML document is an XML document that is well-formed and adheres to the DTD specified by its DOCTYPE declaration.
This section provides a review of DTD features and conventions for quick reference.
The parts of a DTD
A DTD may be composed of the following parts, in no particular order:
Element type declarations
The syntax for an element type definition is as follows:
<!ELEMENT elementName (elementContent)>Element names are case-sensitive. Each element name must begin with a letter or an underscore (_); subsequent characters in the name may be letters, underscores, numbers, hyphens, and periods, but not spaces or tabs.
Element content may consist of parsed character data (that is, text and entity references, expressed as ) and/or other element types. The following symbols may be inserted after any element name or closing parenthesis in the element content definition:
Symbol | Meaning |
None | Exactly one |
+ | One or more |
* | Zero or more |
? | Zero or one |
To require one element to be followed by another, use a comma:
<!ELEMENT elementName (element1, element2)>To indicate that content can include one element or another, use a |:
<!ELEMENT elementName (element1 | element2)>To allow an element to contain a combination of specific elements and #PCDATA in any order, use the following syntax:
<!ELEMENT elementName (#PCDATA | element1 | element2)*>To allow an element to contain any combination of elements and #PCDATA in any order, use the following syntax (note omission of parentheses):
<!ELEMENT elementName ANY>To define an empty element, use the following syntax (note omission of parentheses):
<!ELEMENT elementName EMPTY>Attribute declarations
The syntax for a single attribute definition is as follows:
<!ATTLIST elementName attributeName attributeType defaultValue>Attribute names are case-sensitive. Each attribute name must begin with a letter or an underscore (_); subsequent characters in the name may be letters, underscores, numbers, hyphens and periods, but not spaces or tabs.
Attribute types may be as follows:
Attribute type | Meaning |
CDATA | Character data and entity references, between quotation marks ("") |
ID | Must contain a unique name* for each element of this type |
IDREF | The unique ID name* of an element in the XML file |
ENTITY | An unparsed external entity reference name* defined in the DTD |
ENTITIES | A list of ENTITY names, separated by spaces |
Enumerated | A list of names*, separated by | characters, in parentheses |
NMTOKEN | A value containing only NameChar characters** |
NMTOKENS | A list of NMTOKENs, separated by spaces |
NOTATION | The name of a notation defined in the DTD |
Enumerated NOTATION | A list of NOTATIONs, separated by | characters, in parentheses |
*Names must begin with a letter or an underscore (_); subsequent characters in the name may be letters, underscores, numbers, hyphens, and periods, but not spaces or tabs.
**NameChar characters include letters, underscores, numbers, hyphens, or periods, but not spaces or tabs.
Default attribute values may be as follows:
Attribute type | Meaning |
#REQUIRED | This attribute must be specified by the element |
#IMPLIED | This attribute may or may not be used |
#FIXED value | If not specified, this attribute is assumed to be value; if specified, it must be value |
defaultValue | If not specified, this attribute is assumed to be defaultValue |
Comments
A comment consists of text between a <!-- and a -->. The content of comments should be ignored by XML processors. Comments may not contain "--" and they may not contain other comments.
<!-- This is a comment. Characters such as < and > are legal here. -->Character references
A character reference is way of representing Unicode characters in parsed character data. The syntax for character references is as follows:
&#UnicodeValueOfCharacter;Entity reference declarations
There are five types of entities. The syntax for their declaration is as follows:
Type | Syntax |
Parsed internal | <!ENTITY entityName "text of entity"> |
Parsed external | <!ENTITY entityName SYSTEM "URL of file"> OR ~<!ENTITY entityName PUBLIC "name of file" "URL of file"> |
Unparsed external | <!ENTITY entityName SYSTEM "URL of file" NDATA notationName> OR ~<!ENTITY entityName PUBLIC "name of file" "URL of file" NDATA notationName> |
Internal parameter | <!ENTITY % entityName "text of entity"> |
External parameter | <!ENTITY % entityName SYSTEM "URL of file"> OR ~<!ENTITY % entityName PUBLIC "name of file" "URL of file"> |
The syntax for using the first three types of entity reference is &entityName;. The syntax for using a parameter entity is %entityName;. Parameter entity references are always parsed and may be used only in a DTD.
Notation declarations
Notation declarations should be specified in one of the two following ways:
<!NOTATION notationName SYSTEM "External Identifier">The external identifier should be the name of an application that can process or display files to which this notation is applied. For example:
<!NOTATION gif SYSTEM "Microsoft Internet Explorer">Note that it is up to the application that processes the XML to pass the URL to the application indicated by the external identifier.
Processing instructions
A processing instruction consists of text between a <? and a ?>. Processing instructions are read only by XML processors and may not contain content. The syntax for processing instructions is as follows:
<?target instruction?>Let's say you've just exported an XML file from avenue.quark, and when you go to look at it in your text editor, you see a lower-case "a" with an accent wherever you thought you had a trademark symbol. In fact, a lot of your special symbols look wrong. What happened?
More than likely, your text editor doesn't support the encoding used by your XML file. This section explains the topic in detail.
What is an encoding?
An encoding is specification that maps a set of characters to corresponding numeric values. For example, the ASCII encoding maps the character "M" to the numeric value 77, "N" to 78, "O" to 79, and so forth.
A text file's encoding allows a program to translate the text file into the proper characters on the screen. Without the encoding, a text file is just a stream of numbers. If you view a text file using the wrong encoding, you're likely to see garbage, because the application opening the file will map the numeric values to the wrong set of characters.
All of the following are encodings:
Avenue.quark supports the UTF-8, UTF-16, and Shift-JIS encodings.
Lower and upper character ranges
You can divide most encodings into two parts: the first 128 characters (the lower range), and all of the characters after that (the upper range).
Generally speaking, the lower range of most encodings is mapped to the same characters. This range includes the characters a-z, A-Z, 0-9, a handful of punctuation characters, plus some special control characters.
It's when you get into the upper range that you run into trouble. For example, MacRoman and Windows Latin 1 have lower ranges that are nearly identical. So if you take a file that uses only characters from this range and transfer that file from Mac OS to Windows, it looks fine. But if the file contains upper-range characters, you might get some strange results, because many of the upper-range values are mapped to different characters on each platform. For example, a character that shows up as a trademark symbol in Mac OS might show up as a superscript lower-case A in Windows.
When you get such incorrect character displays, it's either because the application displaying the text doesn't know the encoding of that text, or because the application isn't capable of correctly displaying text with the file's specified encoding.
Specifying encodings
You can indicate the encoding of an XML file by including an encoding specification in the file's XML declaration, like so:
<?xml version="1.0" standalone="yes" encoding="Shift_JIS"?>If an XML file doesn't contain an encoding specification, avenue.quark assumes that the file uses the UTF-8 encoding.
When you save an XML file from avenue.quark, you specify the document's encoding using the Encoding pop-up menu, and avenue.quark automatically generates the appropriate encoding attribute.
Encodings and DTDs
XML lets you specify the encoding of an XML file. However, it doesn't provide a way to specify the encoding of a free-standing DTD file.
Fortunately, avenue.quark does. To specify the encoding of a free-standing DTD, just add the following text as the first line in the file:
<?xml encoding="encodingName" ?>For example, to specify a free-standing DTD as a UTF-16 DTD, just add the following line to the beginning of the file:
<?xml encoding="UTF-16" ?>